Adam Harwood

Adam Harwood

Last updated on 2 November 2023

Adam Harwood is Research Data & Digital Preservation Technologist at the University of Sussex


I talk about digital preservation with IT colleagues regularly and I often feel like I am in a position of perceived weakness - lecturing them on technical matters relating to preserving data. Needing a little confidence boost, I used ChatGPT to see what it had to say about the difference between backup and digital preservation. I posted an interesting outcome of this conversation in the Digital Preservation jiscmail list. Its answers boosted my confidence and I no longer feel those nagging pangs of talking nonsense during technical conversations.

Bouncing off ChatGPT like this made me hungry for more! How could I harness the capabilities of Generative AI (GenAI) to enhance our digital preservation workflows, or to improve the way I communicate the necessary resources we need to run the digital preservation service? I delved in to the literature reading interesting articles on current debates and future perspectives, and on a project undertaken by The National Archives to research the feasibility of using AI tools to assist archivists in selecting digital records. This last paper was discussed at a DPC Reading club session which I sadly missed. This project was very interesting, and while I can’t claim to understand the technologies involved, I agreed with part of their conclusion that states there is a need to educate archive staff in the uses of Machine Learning. The paper was written in 2021 and I’d be interested to hear if this work has progressed since then, especially with the advent of GenAI like ChatGPT and Google Bard.

GenAI offers an exciting potential for individual practitioners, even those with limited resources, to explore its application in digital preservation without significant technical overheads. This is still a work in progress, but here are some ways I envision leveraging ChatGPT and Google Bard to benefit our archive service:

Writing code

A colleague of mine recently asked for help with a RegEx expression to bulk edit urls in a spreadsheet. It occurred to me that ChatGPT could help with this. I explained the requirements to it, and it was able to offer some python code that would do the job. It didn’t work at first, but through further conversations where I explained what the code did and clarifying my instructions, it was able to refine the expression. My colleague was able to sort the syntax before we got a correct answer from ChatGPT, but i felt we were very close; if we could provide it with the right instructions. I think this demonstrates that the more coding skills you have the better questions you can ask of the AI, and the better answers you will get. I’m still hoping I get a place on the DPC’s Python Study Group program!

Summarising digital files

I’ve heard about ChatGPT’s ability to summarise information. I therefore wanted to see if it could summarise the contents of a digital file in our archive to see if we could create a rudimentary workflow to add descriptions to our catalogue records. Sadly, I found that ChatGPT can’t directly process files; instead, it required manual copying and pasting text from the documents, a task that seemed unduly cumbersome. I turned to Google Bard that claims to be able to read files. Unfortunately, my efforts were stymied by my University’s restricted access to Google Bard when using my university credentials. I tried to use my own personal log in, but it wanted me to turn on ‘smart features’ and ‘personalisation settings’, indicating a prerequisite of sharing more of my data with Google’s servers. I couldn’t face going through the tedious labyrinthine personalisation settings, so I’ve parked this idea for now. I’m cautiously optimistic it will work in practice. The challenge that remains is understanding how to scale up this solution effectively.

Summarising tricky concepts

I mentioned above that I asked ChatGPT about the difference between backup and digital preservation. The response i received was presented in a nice bullet point format. I thought it might help me to summarise one of the article’s above, but it told me it was too long. I could have split it up into chunks, but that seemed too onerous, and would it summarise the separate chunks in relation to the other chunks? Still, this seems one of ChatGPT’s greatest strengths.

Making recommendations on how to process files

I’d like to be able to feed a DROID report into an AI and see what tools and processes they recommend be carried out to ensure the items are preserved. This seems like a stretch, but Google Bard accepts files, so this may be a matter of asking the right questions? Failing this, you could always ask ChatGPT or Google Bard a straight up question about each file type.

With all this good stuff, it is important to be aware of the drawbacks and challenges of using GenAI. One significant concern is the environmental impact. They feature large statistical models and are trained on huge datasets. These models require vast computational resources, contributing to increased energy consumption and carbon emissions. A single AI model creates nearly five times more emissions throughout its lifetime than the average American car (source). So before we jump headfirst into using AI, should we consider how it will impact our archive’s carbon footprint?

Are there are any other practitioners out there dipping their toes into the new GenAI landscape?


Disclaimer:

I used ChatGPT to help me write this post. All the ideas and examples are mine - it just helped me with my grammar! I did note that during the editing process, ChatGPT was inclined to use excessive superlatives when talking about itself - something which I have edited out.  Except maybe one!

 


Scroll to top